AITopics | dense prediction task

Twins: Revisiting the Design of Spatial Attention in Vision Transformers

Neural Information Processing SystemsApr-25-2026, 19:59:46 GMT

Very recently, a variety of vision transformer architectures for dense prediction tasks have been proposed and they show that the design of spatial attention is critical to their success in these tasks. In this work, we revisit the design of the spatial attention and demonstrate that a carefully devised yet simple spatial attention mechanism performs favorably against the state-of-the-art schemes. As a result, we propose two vision transformer architectures, namely, Twins-PCPVT and TwinsSVT. Our proposed architectures are highly efficient and easy to implement, only involving matrix multiplications that are highly optimized in modern deep learning frameworks. More importantly, the proposed architectures achieve excellent performance on a wide range of visual tasks including image-level classification as well as dense detection and segmentation. The simplicity and strong performance suggest that our proposed architectures may serve as stronger backbones for many vision tasks. Our code is available at: https://git.io/Twins.

artificial intelligence, machine learning, transformer, (15 more...)

Neural Information Processing Systems

Genre: Research Report (0.68)

Technology:

Information Technology > Artificial Intelligence > Vision (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (1.00)

Add feedback

FLSL: Feature-level Self-supervised Learning

Neural Information Processing SystemsApr-25-2026, 04:38:57 GMT

Current self-supervised learning (SSL) methods (e.g., SimCLR, DINO, VICReg, MOCOv3) target primarily on representations at instance level and do not generalize well to dense prediction tasks, such as object detection and segmentation. Towards aligning SSL with dense predictions, this paper demonstrates for the first time the underlying mean-shift clustering process of Vision Transformers (ViT), which aligns well with natural image semantics (e.g., a world of objects and stuffs). By employing transformer for joint embedding and clustering, we propose a bi-level feature clustering SSL method, coined Feature-Level Self-supervised Learning (FLSL). We present the formal definition of the FLSL problem and construct the objectives from the mean-shift and k-means perspectives. We show that FLSL promotes remarkable semantic cluster representations and learns an encoding scheme amenable to intra-view and inter-view feature clustering. Experiments show that FLSL yields significant improvements in dense prediction tasks, achieving 44.9 (+2.8)% AP and 46.5% AP in object detection, as well as 40.8 (+2.3)%

artificial intelligence, machine learning, representation, (15 more...)

Neural Information Processing Systems

Industry: Government (0.46)

Technology:

Information Technology > Artificial Intelligence > Vision (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Inductive Learning (0.91)
Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning > Clustering (0.46)

Add feedback

AiluRus: A Scalable ViT Framework for Dense Prediction

Neural Information Processing SystemsFeb-12-2026, 17:45:33 GMT

As a result, our method significantly accelerates ViTs for dense prediction tasks.

artificial intelligence, machine learning, natural language, (18 more...)

Neural Information Processing Systems

Country:

Asia > China > Shanghai > Shanghai (0.04)
Asia > China > Guangxi Province > Nanning (0.04)
Europe > France > Grand Est > Bas-Rhin > Strasbourg (0.04)

Genre: Research Report > New Finding (1.00)

Technology:

Information Technology > Sensing and Signal Processing > Image Processing (1.00)
Information Technology > Artificial Intelligence > Vision (1.00)
Information Technology > Artificial Intelligence > Representation & Reasoning (1.00)
(3 more...)

Add feedback

e6c2e85db1f1039177c4495ccd399ac4-Paper-Conference.pdf

Neural Information Processing SystemsFeb-12-2026, 12:57:34 GMT

arxiv preprint arxiv, transformer, vision transformer, (11 more...)

Neural Information Processing Systems

Country:

Europe > Switzerland > Zürich > Zürich (0.04)
Europe > Italy > Calabria > Catanzaro Province > Catanzaro (0.04)
Asia > China > Guangxi Province > Nanning (0.04)

Genre: Research Report (1.00)

Technology:

Information Technology > Artificial Intelligence > Vision (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning (0.69)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.69)
Information Technology > Artificial Intelligence > Natural Language > Text Processing (0.68)

Add feedback

83ccb398f3ce9c4d137011f36a03c7d4-Paper-Conference.pdf

Neural Information Processing SystemsFeb-10-2026, 09:15:43 GMT

We introduce point affiliation into feature upsampling, a notion that describes the affiliation of each upsampled point to asemantic cluster formed by local decoder feature points with semantic similarity. By rethinking point affiliation, we present a generic formulation for generating upsampling kernels. The kernels encourage notonly semantic smoothness butalsoboundary sharpness intheupsampled feature maps. Such properties are particularly useful for some dense prediction tasks such as semantic segmentation. The key idea of our formulation istogenerate similarity-awarekernels bycomparing thesimilarity between each encoder feature point and the spatially associated local region of decoder features.

artificial intelligence, kernel, machine learning, (17 more...)

Neural Information Processing Systems

Country:

Asia > China (0.05)
Asia > Middle East > Republic of Türkiye > Karaman Province > Karaman (0.04)

Technology: Information Technology > Artificial Intelligence > Machine Learning (1.00)

Add feedback

4e0928de075538c593fbdabb0c5ef2c3-Paper.pdf

Neural Information Processing SystemsFeb-8-2026, 14:35:18 GMT

arxiv preprint arxiv, transformer, vision transformer, (12 more...)

Neural Information Processing Systems

Country: Oceania > Australia > South Australia > Adelaide (0.04)

Genre: Research Report (0.68)

Technology:

Information Technology > Sensing and Signal Processing > Image Processing (1.00)
Information Technology > Artificial Intelligence > Vision (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (1.00)

Add feedback

15212bd2265c4a3ab0dbc1b1982c1b69-Paper-Conference.pdf

Neural Information Processing SystemsFeb-8-2026, 04:36:48 GMT

artificial intelligence, machine learning, natural language, (15 more...)

Neural Information Processing Systems

Country:

North America > United States > Colorado (0.04)
Asia > Myanmar > Tanintharyi Region > Dawei (0.04)

Industry: Government (0.68)

Technology:

Information Technology > Artificial Intelligence > Vision (1.00)
Information Technology > Artificial Intelligence > Representation & Reasoning (1.00)
Information Technology > Artificial Intelligence > Natural Language (1.00)
(2 more...)

Add feedback

3000311ca56a1cb93397bc676c0b7fff-Paper.pdf

Neural Information Processing SystemsFeb-7-2026, 23:45:07 GMT

learning, pixel, representation, (16 more...)

Neural Information Processing Systems

Country:

North America > Canada > Quebec > Montreal (0.04)
Asia > Japan > Honshū > Chūbu > Ishikawa Prefecture > Kanazawa (0.04)

Genre: Research Report (0.66)

Industry: Media (0.46)

Technology:

Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks (1.00)
Information Technology > Artificial Intelligence > Vision > Image Understanding (0.93)
(2 more...)

Add feedback

AiluRus: A Scalable ViT Framework for Dense Prediction

Neural Information Processing SystemsDec-25-2025, 16:36:59 GMT

Vision transformers (ViTs) have emerged as a prevalent architecture for vision tasks owing to their impressive performance. However, their complexity dramatically increases when handling long token sequences, particularly for dense prediction tasks that require high-resolution input. Notably, dense prediction tasks, such as semantic segmentation or object detection, emphasize more on the contours or shapes of objects, while the texture inside objects is less informative. Motivated by this observation, we propose to apply adaptive resolution for different regions in the image according to their importance. Specifically, at the intermediate layer of the ViT, we select anchors from the token sequence using the proposed spatial-aware density-based clustering algorithm. Tokens that are adjacent to anchors are merged to form low-resolution regions, while others are preserved independently as high-resolution.

dense prediction task, name change, scalable vit framework, (6 more...)

Neural Information Processing Systems

Technology:

Information Technology > Artificial Intelligence > Vision (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning > Clustering (0.81)

Add feedback

SAPA: Similarity-Aware Point Affiliation for Feature Upsampling

Neural Information Processing SystemsDec-24-2025, 15:42:57 GMT

We introduce point affiliation into feature upsampling, a notion that describes the affiliation of each upsampled point to a semantic cluster formed by local decoder feature points with semantic similarity. By rethinking point affiliation, we present a generic formulation for generating upsampling kernels. The kernels encourage not only semantic smoothness but also boundary sharpness in the upsampled feature maps. Such properties are particularly useful for some dense prediction tasks such as semantic segmentation. The key idea of our formulation is to generate similarity-aware kernels by comparing the similarity between each encoder feature point and the spatially associated local region of decoder features. In this way, the encoder feature point can function as a cue to inform the semantic cluster of upsampled feature points. To embody the formulation, we further instantiate a lightweight upsampling operator, termed Similarity-Aware Point Affiliation (SAPA), and investigate its variants. SAPA invites consistent performance improvements on a number of dense prediction tasks, including semantic segmentation, object detection, depth estimation, and image matting. Code is available at: https://github.com/poppinace/sapa

feature point, name change, similarity-aware point affiliation, (8 more...)

Neural Information Processing Systems

Technology:

Information Technology > Artificial Intelligence > Machine Learning (0.64)
Information Technology > Artificial Intelligence > Vision (0.60)

Add feedback

Filters

Collaborating Authors

dense prediction task

Information about AI from the News, Publications, and Conferences

Automatic Classification – Tagging and Summarization – Customizable Filtering and Analysis

Twins: Revisiting the Design of Spatial Attention in Vision Transformers

FLSL: Feature-level Self-supervised Learning

AiluRus: A Scalable ViT Framework for Dense Prediction

e6c2e85db1f1039177c4495ccd399ac4-Paper-Conference.pdf

83ccb398f3ce9c4d137011f36a03c7d4-Paper-Conference.pdf

4e0928de075538c593fbdabb0c5ef2c3-Paper.pdf

15212bd2265c4a3ab0dbc1b1982c1b69-Paper-Conference.pdf

3000311ca56a1cb93397bc676c0b7fff-Paper.pdf

AiluRus: A Scalable ViT Framework for Dense Prediction

SAPA: Similarity-Aware Point Affiliation for Feature Upsampling